Linguistic Features
Processing raw text intelligently is difficult: most words are rare, and it’s
common for words that look completely different to mean almost the same thing.
The same words in a different order can mean something completely different.
Even splitting text into useful word-like units can be difficult in many
languages. While it’s possible to solve some problems starting from only the raw
characters, it’s usually better to use linguistic knowledge to add useful
information. That’s exactly what spaCy is designed to do: you put in raw text,
and get back a Doc
object, that comes with a variety of
annotations.
Part-of-speech tagging Needs model
After tokenization, spaCy can parse and tag a given Doc
. This is where
the trained pipeline and its statistical models come in, which enable spaCy to
make predictions of which tag or label most likely applies in this context.
A trained component includes binary data that is produced by showing a system
enough examples for it to make predictions that generalize across the language –
for example, a word following “the” in English is most likely a noun.
Linguistic annotations are available as
Token
attributes. Like many NLP libraries, spaCy
encodes all strings to hash values to reduce memory usage and improve
efficiency. So to get the readable string representation of an attribute, we
need to add an underscore _
to its name:
Editable Code
Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
---|---|---|---|---|---|---|---|
Apple | apple | PROPN | NNP | nsubj | Xxxxx | True | False |
is | be | AUX | VBZ | aux | xx | True | True |
looking | look | VERB | VBG | ROOT | xxxx | True | False |
at | at | ADP | IN | prep | xx | True | True |
buying | buy | VERB | VBG | pcomp | xxxx | True | False |
U.K. | u.k. | PROPN | NNP | compound | X.X. | False | False |
startup | startup | NOUN | NN | dobj | xxxx | True | False |
for | for | ADP | IN | prep | xxx | True | True |
$ | $ | SYM | $ | quantmod | $ | False | False |
1 | 1 | NUM | CD | compound | d | False | False |
billion | billion | NUM | CD | pobj | xxxx | True | False |
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its dependencies look like:
Morphology
Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:
Context | Surface | Lemma | POS | Morphological Features |
---|---|---|---|---|
I was reading the paper | reading | read | VERB | VerbForm=Ger |
I don’t watch the news, I read the paper | read | read | VERB | VerbForm=Fin , Mood=Ind , Tense=Pres |
I read the paper yesterday | read | read | VERB | VerbForm=Fin , Mood=Ind , Tense=Past |
Morphological features are stored in the
MorphAnalysis
under Token.morph
, which
allows you to access individual morphological features.
Editable Code
Statistical morphology v3.0Needs model
spaCy’s statistical Morphologizer
component assigns the
morphological features and coarse-grained part-of-speech tags as Token.morph
and Token.pos
.
Editable Code
Rule-based morphology
For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the token text and fine-grained part-of-speech tags to produce coarse-grained part-of-speech tags and morphological features.
- The part-of-speech tagger assigns each token a fine-grained part-of-speech
tag. In the API, these tags are known as
Token.tag
. They express the part-of-speech (e.g. verb) and some amount of morphological information, e.g. that the verb is past tense (e.g.VBD
for a past tense verb in the Penn Treebank) . - For words whose coarse-grained POS is not set by a prior process, a mapping table maps the fine-grained tags to a coarse-grained POS tags and morphological features.
Editable Code
Lemmatization v3.0
spaCy provides two pipeline components for lemmatization:
- The
Lemmatizer
component provides lookup and rule-based lemmatization methods in a configurable component. An individual language can extend theLemmatizer
as part of its language data. - The
EditTreeLemmatizer
v3.3 component provides a trainable lemmatizer.
Editable Code
The data for spaCy’s lemmatizers is distributed in the package
spacy-lookups-data
. The
provided trained pipelines already include all the required tables, but if you
are creating new pipelines, you’ll probably want to install spacy-lookups-data
to provide the data when the lemmatizer is initialized.
Lookup lemmatizer
For pipelines without a tagger or morphologizer, a lookup lemmatizer can be
added to the pipeline as long as a lookup table is provided, typically through
spacy-lookups-data
. The
lookup lemmatizer looks up the token surface form in the lookup table without
reference to the token’s part-of-speech or context.
Rule-based lemmatizer Needs model
When training pipelines that include a component that assigns part-of-speech
tags (a morphologizer or a tagger with a POS mapping), a
rule-based lemmatizer can be added using rule tables from
spacy-lookups-data
:
The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from WordNet.
Trainable lemmatizer Needs model
The EditTreeLemmatizer
can learn form-to-lemma
transformations from a training corpus that includes lemma annotations. This
removes the need to write language-specific rules and can (in many cases)
provide higher accuracies than lookup and rule-based lemmatizers.
Dependency Parsing Needs model
spaCy features a fast and accurate syntactic dependency parser, and has a rich
API for navigating the tree. The parser also powers the sentence boundary
detection, and lets you iterate over base noun phrases, or “chunks”. You can
check whether a Doc
object has been parsed by calling
doc.has_annotation("DEP")
, which checks whether the attribute Token.dep
has
been set returns a boolean value. If the result is False
, the default sentence
iterator will raise an exception.
Noun chunks
Noun chunks are “base noun phrases” – flat phrases that have a noun as their
head. You can think of noun chunks as a noun plus the words describing the noun
– for example, “the lavish green grass” or “the world’s largest tech fund”. To
get the noun chunks in a document, simply iterate over
Doc.noun_chunks
.
Editable Code
Text | root.text | root.dep_ | root.head.text |
---|---|---|---|
Autonomous cars | cars | nsubj | shift |
insurance liability | liability | dobj | shift |
manufacturers | manufacturers | pobj | toward |
Navigating the parse tree
spaCy uses the terms head and child to describe the words connected by
a single arc in the dependency tree. The term dep is used for the arc
label, which describes the type of syntactic relation that connects the child to
the head. As with other attributes, the value of .dep
is a hash value. You can
get the string value with .dep_
.
Editable Code
Text | Dep | Head text | Head POS | Children |
---|---|---|---|---|
Autonomous | amod | cars | NOUN | |
cars | nsubj | shift | VERB | Autonomous |
shift | ROOT | shift | VERB | cars, liability, toward |
insurance | compound | liability | NOUN | |
liability | dobj | shift | VERB | insurance |
toward | prep | shift | NOUN | manufacturers |
manufacturers | pobj | toward | ADP |
Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest – from below:
Editable Code
If you try to match from above, you’ll have to iterate twice. Once for the head, and then again through the children:
To iterate through the children, use the token.children
attribute, which
provides a sequence of Token
objects.
Iterating around the local tree
A few more convenience attributes are provided for iterating around the local
tree from the token. Token.lefts
and
Token.rights
attributes provide sequences of syntactic
children that occur before and after the token. Both sequences are in sentence
order. There are also two integer-typed attributes,
Token.n_lefts
and
Token.n_rights
that give the number of left and right
children.
Editable Code
Editable Code
You can get a whole phrase by its syntactic head using the
Token.subtree
attribute. This returns an ordered
sequence of tokens. You can walk up the tree with the
Token.ancestors
attribute, and check dominance with
Token.is_ancestor
Editable Code
Text | Dep | n_lefts | n_rights | ancestors |
---|---|---|---|---|
Credit | nmod | 0 | 2 | holders, submit |
and | cc | 0 | 0 | holders, submit |
mortgage | compound | 0 | 0 | account, Credit, holders, submit |
account | conj | 1 | 0 | Credit, holders, submit |
holders | nsubj | 1 | 0 | submit |
Finally, the .left_edge
and .right_edge
attributes can be especially useful,
because they give you the first and last token of the subtree. This is the
easiest way to create a Span
object for a syntactic phrase. Note that
.right_edge
gives a token within the subtree – so if you use it as the
end-point of a range, don’t forget to +1
!
Editable Code
Text | POS | Dep | Head text |
---|---|---|---|
Credit and mortgage account holders | NOUN | nsubj | submit |
must | VERB | aux | submit |
submit | VERB | ROOT | submit |
their | ADJ | poss | requests |
requests | NOUN | dobj | submit |
The dependency parse can be a useful tool for information extraction,
especially when combined with other predictions like
named entities. The following example extracts money and
currency values, i.e. entities labeled as MONEY
, and then uses the dependency
parse to find the noun phrase they are referring to – for example "Net income"
→ "$9.4 million"
.
Editable Code
Visualizing dependencies
The best way to understand spaCy’s dependency parser is interactively. To make
this easier, spaCy comes with a visualization module. You can pass a Doc
or a
list of Doc
objects to displaCy and run
displacy.serve
to run the web server, or
displacy.render
to generate the raw markup.
If you want to know how to write rules that hook into some type of syntactic
construction, just plug the sentence into the visualizer and see how spaCy
annotates it.
Editable Code
Disabling the parser
In the trained pipelines provided by spaCy, the parser is loaded and
enabled by default as part of the
standard processing pipeline. If you don’t need
any of the syntactic information, you should disable the parser. Disabling the
parser will make spaCy load and run much faster. If you want to load the parser,
but need to disable it for specific documents, you can also control its use on
the nlp
object. For more details, see the usage guide on
disabling pipeline components.
Named Entity Recognition
spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.
Named Entity Recognition 101
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.
Named entities are available as the ents
property of a Doc
:
Editable Code
Text | Start | End | Label | Description |
---|---|---|---|---|
Apple | 0 | 5 | ORG | Companies, agencies, institutions. |
U.K. | 27 | 31 | GPE | Geopolitical entity, i.e. countries, cities, states. |
$1 billion | 44 | 54 | MONEY | Monetary values, including unit. |
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:
Accessing entity annotations and labels
The standard way to access entity annotations is the doc.ents
property, which produces a sequence of Span
objects. The entity
type is accessible either as a hash value or as a string, using the attributes
ent.label
and ent.label_
. The Span
object acts as a sequence of tokens, so
you can iterate over the entity or index into it. You can also get the text form
of the whole entity, as though it were a single token.
You can also access token entity annotations using the
token.ent_iob
and
token.ent_type
attributes. token.ent_iob
indicates
whether an entity starts, continues or ends on the tag. If no entity type is set
on a token, it will return an empty string.
Editable Code
Text | ent_iob | ent_iob_ | ent_type_ | Description |
---|---|---|---|---|
San | 3 | B | "GPE" | beginning of an entity |
Francisco | 1 | I | "GPE" | inside an entity |
considers | 2 | O | "" | outside an entity |
banning | 2 | O | "" | outside an entity |
sidewalk | 2 | O | "" | outside an entity |
delivery | 2 | O | "" | outside an entity |
robots | 2 | O | "" | outside an entity |
Setting entity annotations
To ensure that the sequence of token annotations remains consistent, you have to
set entity annotations at the document level. However, you can’t write
directly to the token.ent_iob
or token.ent_type
attributes, so the easiest
way to set entities is to use the doc.set_ents
function
and create the new entity as a Span
.
Editable Code
Keep in mind that Span
is initialized with the start and end token
indices, not the character offsets. To create a span from character offsets, use
Doc.char_span
:
Setting entity annotations from array
You can also assign entity annotations using the
doc.from_array
method. To do this, you should include
both the ENT_TYPE
and the ENT_IOB
attributes in the array you’re importing
from.
Editable Code
Setting entity annotations in Cython
Finally, you can always write to the underlying struct if you compile a Cython function. This is easy to do, and allows you to write efficient native code.
Obviously, if you write directly to the array of TokenC*
structs, you’ll have
responsibility for ensuring that the data is left in a consistent state.
Built-in entity types
Visualizing named entities
The
displaCy ENT visualizer
lets you explore an entity recognition model’s behavior interactively. If you’re
training a model, it’s very useful to run the visualization yourself. To help
you do that, spaCy comes with a visualization module. You can pass a Doc
or a
list of Doc
objects to displaCy and run
displacy.serve
to run the web server, or
displacy.render
to generate the raw markup.
For more details and examples, see the usage guide on visualizing spaCy.
Named Entity example
Entity Linking
To ground the named entities into the “real world”, spaCy provides functionality
to perform entity linking, which resolves a textual entity to a unique
identifier from a knowledge base (KB). You can create your own
KnowledgeBase
and train a new
EntityLinker
using that custom knowledge base.
As an example on how to define a KnowledgeBase and train an entity linker model,
see this tutorial
using spaCy projects.
Accessing entity identifiers Needs model
The annotated KB identifier is accessible as either a hash value or as a string,
using the attributes ent.kb_id
and ent.kb_id_
of a Span
object, or the ent_kb_id
and ent_kb_id_
attributes of a
Token
object.
Tokenization
Tokenization is the task of splitting a text into meaningful segments, called
tokens. The input to the tokenizer is a unicode text, and the output is a
Doc
object. To construct a Doc
object, you need a
Vocab
instance, a sequence of word
strings, and optionally a
sequence of spaces
booleans, which allow you to maintain alignment of the
tokens into the original string.
During processing, spaCy first tokenizes the text, i.e. segments it into
words, punctuation and so on. This is done by applying rules specific to each
language. For example, punctuation at the end of a sentence should be split off
– whereas “U.K.” should remain one token. Each Doc
consists of individual
tokens, and we can iterate over them:
Editable Code
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Apple | is | looking | at | buying | U.K. | startup | for | $ | 1 | billion |
First, the raw text is split on whitespace characters, similar to
text.split(' ')
. Then, the tokenizer processes the text from left to right. On
each substring, it performs two checks:
-
Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
-
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.
While punctuation rules are usually pretty general, tokenizer exceptions
strongly depend on the specifics of the individual language. This is why each
available language has its own subclass, like
English
or German
, that loads in lists of hard-coded data and exception
rules.
spaCy introduces a novel tokenization algorithm that gives a better balance between performance, ease of definition and ease of alignment into the original string.
After consuming a prefix or suffix, we consult the special cases again. We want the special cases to handle things like “don’t” in English, and we want the same rule to work for “(don’t)!“. We do this by splitting off the open bracket, then the exclamation, then the closed bracket, and finally matching the special case. Here’s an implementation of the algorithm in Python optimized for readability rather than performance:
The algorithm can be summarized as follows:
- Iterate over space-separated substrings.
- Check whether we have an explicitly defined special case for this substring. If we do, use it.
- Look for a token match. If there is a match, stop processing and keep this token.
- Check whether we have an explicitly defined special case for this substring. If we do, use it.
- Otherwise, try to consume one prefix. If we consumed a prefix, go back to #3, so that the token match and special cases always get priority.
- If we didn’t consume a prefix, try to consume a suffix and then go back to #3.
- If we can’t consume a prefix or a suffix, look for a URL match.
- If there’s no URL match, then look for a special case.
- Look for “infixes” – stuff like hyphens etc. and split the substring into tokens on all infixes.
- Once we can’t consume any more of the string, handle it as a single token.
- Make a final pass over the text to check for special cases that include spaces or that were missed due to the incremental processing of affixes.
Global and language-specific tokenizer data is supplied via the language
data in spacy/lang
. The tokenizer exceptions
define special cases like “don’t” in English, which needs to be split into two
tokens: {ORTH: "do"}
and {ORTH: "n't", NORM: "not"}
. The prefixes, suffixes
and infixes mostly define punctuation rules – for example, when to split off
periods (at the end of a sentence), and when to leave tokens containing periods
intact (abbreviations like “U.S.”).
Tokenization rules that are specific to one language, but can be generalized
across that language, should ideally live in the language data in
spacy/lang
– we always appreciate pull requests!
Anything that’s specific to a domain or text type – like financial trading
abbreviations or Bavarian youth slang – should be added as a special case rule
to your tokenizer instance. If you’re dealing with a lot of customizations, it
might make sense to create an entirely custom subclass.
Adding special case tokenization rules
Most domains have at least some idiosyncrasies that require custom tokenization
rules. This could be very certain expressions, or abbreviations only used in
this specific field. Here’s how to add a special case rule to an existing
Tokenizer
instance:
Editable Code
The special case doesn’t have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring. The special case rules also have precedence over the punctuation splitting.
Debugging the tokenizer
A working implementation of the pseudo-code above is available for debugging as
nlp.tokenizer.explain(text)
. It returns a list of
tuples showing which tokenizer rule or pattern was matched for each token. The
tokens produced are identical to nlp.tokenizer()
except for whitespace tokens:
Editable Code
Customizing spaCy’s Tokenizer class
Let’s imagine you wanted to create a tokenizer for a new language or specific domain. There are six things you may need to define:
- A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
- A function
prefix_search
, to handle preceding punctuation, such as open quotes, open brackets, etc. - A function
suffix_search
, to handle succeeding punctuation, such as commas, periods, close quotes, etc. - A function
infix_finditer
, to handle non-whitespace separators, such as hyphens etc. - An optional boolean function
token_match
matching strings that should never be split, overriding the infix rules. Useful for things like numbers. - An optional boolean function
url_match
, which is similar totoken_match
except that prefixes and suffixes are removed before applying the match.
You shouldn’t usually need to create a Tokenizer
subclass. Standard usage is
to use re.compile()
to build a regular expression object, and pass its
.search()
and .finditer()
methods:
Editable Code
If you need to subclass the tokenizer instead, the relevant methods to
specialize are find_prefix
, find_suffix
and find_infix
.
Modifying existing rule sets
In many situations, you don’t necessarily need entirely custom rules. Sometimes
you just want to add another character to the prefixes, suffixes or infixes. The
default prefix, suffix and infix rules are available via the nlp
object’s
Defaults
and the Tokenizer
attributes such as
Tokenizer.suffix_search
are writable, so you can
overwrite them with compiled regular expression objects using modified default
rules. spaCy ships with utility functions to help you compile the regular
expressions – for example,
compile_suffix_regex
:
Similarly, you can remove a character from the default suffixes:
The Tokenizer.suffix_search
attribute should be a function which takes a
unicode string and returns a regex match object or None
. Usually we use
the .search
attribute of a compiled regex object, but you can use some other
function that behaves the same way.
The prefix, infix and suffix rule sets include not only individual characters
but also detailed regular expressions that take the surrounding context into
account. For example, there is a regular expression that treats a hyphen between
letters as an infix. If you do not want the tokenizer to split on hyphens
between letters, you can modify the existing infix definition from
lang/punctuation.py
:
Editable Code
For an overview of the default regular expressions, see
lang/punctuation.py
and
language-specific definitions such as
lang/de/punctuation.py
for
German.
Hooking a custom tokenizer into the pipeline
The tokenizer is the first component of the processing pipeline and the only one
that can’t be replaced by writing to nlp.pipeline
. This is because it has a
different signature from all the other components: it takes a text and returns a
Doc
, whereas all other components expect to already receive a
tokenized Doc
.
To overwrite the existing tokenizer, you need to replace nlp.tokenizer
with a
custom function that takes a text and returns a Doc
.
Argument | Type | Description |
---|---|---|
text | str | The raw text to tokenize. |
RETURNS | Doc | The tokenized document. |
Example 1: Basic whitespace tokenizer
Here’s an example of the most basic whitespace tokenizer. It takes the shared
vocab, so it can construct Doc
objects. When it’s called on a text, it returns
a Doc
object consisting of the text split on single space characters. We can
then overwrite the nlp.tokenizer
attribute with an instance of our custom
tokenizer.
Editable Code
Example 2: Third-party tokenizers (BERT word pieces)
You can use the same approach to plug in any other third-party tokenizers. Your
custom callable just needs to return a Doc
object with the tokens produced by
your tokenizer. In this example, the wrapper uses the BERT word piece
tokenizer, provided by the
tokenizers
library. The tokens
available in the Doc
object returned by spaCy now match the exact word pieces
produced by the tokenizer.
Custom BERT word piece tokenizer
Training with custom tokenization v3.0
spaCy’s training config describes the settings,
hyperparameters, pipeline and tokenizer used for constructing and training the
pipeline. The [nlp.tokenizer]
block refers to a registered function that
takes the nlp
object and returns a tokenizer. Here, we’re registering a
function called whitespace_tokenizer
in the
@tokenizers
registry. To make sure spaCy knows how
to construct your tokenizer during training, you can pass in your Python file by
setting --code functions.py
when you run spacy train
.
functions.py
Registered functions can also take arguments that are then passed in from the
config. This allows you to quickly change and keep track of different settings.
Here, the registered function called bert_word_piece_tokenizer
takes two
arguments: the path to a vocabulary file and whether to lowercase the text. The
Python type hints str
and bool
ensure that the received values have the
correct type.
functions.py
To avoid hard-coding local paths into your config file, you can also set the
vocab path on the CLI by using the --nlp.tokenizer.vocab_file
override when you run
spacy train
. For more details on using registered functions,
see the docs in training with custom code.
Using pre-tokenized text
spaCy generally assumes by default that your data is raw text. However,
sometimes your data is partially annotated, e.g. with pre-existing tokenization,
part-of-speech tags, etc. The most common situation is that you have
pre-defined tokenization. If you have a list of strings, you can create a
Doc
object directly. Optionally, you can also specify a list of
boolean values, indicating whether each word is followed by a space.
Editable Code
If provided, the spaces list must be the same length as the words list. The
spaces list affects the doc.text
, span.text
, token.idx
, span.start_char
and span.end_char
attributes. If you don’t provide a spaces
sequence, spaCy
will assume that all words are followed by a space. Once you have a
Doc
object, you can write to its attributes to set the
part-of-speech tags, syntactic dependencies, named entities and other
attributes.
Aligning tokenization
spaCy’s tokenization is non-destructive and uses language-specific rules
optimized for compatibility with treebank annotations. Other tools and resources
can sometimes tokenize things differently – for example, "I'm"
→
["I", "'", "m"]
instead of ["I", "'m"]
.
In situations like that, you often want to align the tokenization so that you
can merge annotations from different sources together, or take vectors predicted
by a
pretrained BERT model and
apply them to spaCy tokens. spaCy’s Alignment
object allows the one-to-one mappings of token indices in both directions as
well as taking into account indices where multiple tokens align to one single
token.
Editable Code
Here are some insights from the alignment information generated in the example above:
- The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they’re also identical in the
input:
"i"
,"listened"
,"to"
and"obama"
. - The value of
x2y.data[6]
is5
, which means thatother_tokens[6]
("podcasts"
) aligns tospacy_tokens[5]
(also"podcasts"
). x2y.data[4]
andx2y.data[5]
are both4
, which means that both tokens 4 and 5 ofother_tokens
("'"
and"s"
) align to token 4 ofspacy_tokens
("'s"
).
Merging and splitting
The Doc.retokenize
context manager lets you merge and
split tokens. Modifications to the tokenization are stored and performed all at
once when the context manager exits. To merge several tokens into one single
token, pass a Span
to retokenizer.merge
. An
optional dictionary of attrs
lets you set attributes that will be assigned to
the merged token – for example, the lemma, part-of-speech tag or entity type. By
default, the merged token will receive the same attributes as the merged span’s
root.
Editable Code
If an attribute in the attrs
is a context-dependent token attribute, it will
be applied to the underlying Token
. For example LEMMA
, POS
or DEP
only apply to a word in context, so they’re token attributes. If an
attribute is a context-independent lexical attribute, it will be applied to the
underlying Lexeme
, the entry in the vocabulary. For example,
LOWER
or IS_STOP
apply to all words of the same spelling, regardless of the
context.
Splitting tokens
The retokenizer.split
method allows splitting
one token into two or more tokens. This can be useful for cases where
tokenization rules alone aren’t sufficient. For example, you might want to split
“its” into the tokens “it” and “is” – but not the possessive pronoun “its”. You
can write rule-based logic that can find only the correct “its” to split, but by
that time, the Doc
will already be tokenized.
This process of splitting a token requires more settings, because you need to
specify the text of the individual tokens, optional per-token attributes and how
the tokens should be attached to the existing syntax tree. This can be done by
supplying a list of heads
– either the token to attach the newly split token
to, or a (token, subtoken)
tuple if the newly split token should be attached
to another subtoken. In this case, “New” should be attached to “York” (the
second split subtoken) and “York” should be attached to “in”.
Editable Code
Specifying the heads as a list of token
or (token, subtoken)
tuples allows
attaching split subtokens to other subtokens, without having to keep track of
the token indices after splitting.
Token | Head | Description |
---|---|---|
"New" | (doc[3], 1) | Attach this token to the second subtoken (index 1 ) that doc[3] will be split into, i.e. “York”. |
"York" | doc[2] | Attach this token to doc[1] in the original Doc , i.e. “in”. |
If you don’t care about the heads (for example, if you’re only running the tokenizer and not the parser), you can attach each subtoken to itself:
Overwriting custom extension attributes
If you’ve registered custom
extension attributes,
you can overwrite them during tokenization by providing a dictionary of
attribute names mapped to new values as the "_"
key in the attrs
. For
merging, you need to provide one dictionary of attributes for the resulting
merged token. For splitting, you need to provide a list of dictionaries with
custom attributes, one per split subtoken.
Editable Code
Sentence Segmentation
A Doc
object’s sentences are available via the Doc.sents
property. To view a Doc
’s sentences, you can iterate over the Doc.sents
, a
generator that yields Span
objects. You can check whether a Doc
has sentence boundaries by calling
Doc.has_annotation
with the attribute name
"SENT_START"
.
Editable Code
spaCy provides four alternatives for sentence segmentation:
- Dependency parser: the statistical
DependencyParser
provides the most accurate sentence boundaries based on full dependency parses. - Statistical sentence segmenter: the statistical
SentenceRecognizer
is a simpler and faster alternative to the parser that only sets sentence boundaries. - Rule-based pipeline component: the rule-based
Sentencizer
sets sentence boundaries using a customizable list of sentence-final punctuation. - Custom function: your own custom function added to the
processing pipeline can set sentence boundaries by writing to
Token.is_sent_start
.
Default: Using the dependency parse Needs model
Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually the most accurate approach, but it requires a trained pipeline that provides accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box with spaCy’s provided trained pipelines. For social media or conversational text that doesn’t follow the same rules, your application may benefit from a custom trained or rule-based component.
Editable Code
spaCy’s dependency parser respects already set boundaries, so you can preprocess
your Doc
using custom components before it’s parsed. Depending on your text,
this may also improve parse accuracy, since the parser is constrained to predict
parses consistent with the sentence boundaries.
Statistical sentence segmenter v3.0Needs model
The SentenceRecognizer
is a simple statistical
component that only provides sentence boundaries. Along with being faster and
smaller than the parser, its primary advantage is that it’s easier to train
because it only requires annotated sentence boundaries rather than full
dependency parses. spaCy’s trained pipelines include both a parser
and a trained sentence segmenter, which is
disabled by default. If you only need
sentence boundaries and no parser, you can use the exclude
or disable
argument on spacy.load
to load the pipeline
without the parser and then enable the sentence recognizer explicitly with
nlp.enable_pipe
.
Editable Code
Rule-based pipeline component
The Sentencizer
component is a
pipeline component that splits sentences on
punctuation like .
, !
or ?
. You can plug it into your pipeline if you only
need sentence boundaries without dependency parses.
Editable Code
Custom rule-based strategy
If you want to implement your own strategy that differs from the default
rule-based approach of splitting on sentences, you can also create a
custom pipeline component that
takes a Doc
object and sets the Token.is_sent_start
attribute on each
individual token. If set to False
, the token is explicitly marked as not the
start of a sentence. If set to None
(default), it’s treated as a missing value
and can still be overwritten by the parser.
Here’s an example of a component that implements a pre-processing rule for
splitting on "..."
tokens. The component is added before the parser, which is
then used to further segment the text. That’s possible, because is_sent_start
is only set to True
for some of the tokens – all others still specify None
for unset sentence boundaries. This approach can be useful if you want to
implement additional rules specific to your data, while still being able to
take advantage of dependency-based sentence segmentation.
Editable Code
Mappings & Exceptions v3.0
The AttributeRuler
manages rule-based mappings and
exceptions for all token-level attributes. As the number of
pipeline components has grown from spaCy v2 to
v3, handling rules and exceptions in each component individually has become
impractical, so the AttributeRuler
provides a single component with a unified
pattern format for all token attribute mappings and exceptions.
The AttributeRuler
uses
Matcher
patterns to identify
tokens and then assigns them the provided attributes. If needed, the
Matcher
patterns can include context around the target token.
For example, the attribute ruler can:
- provide exceptions for any token attributes
- map fine-grained tags to coarse-grained tags for languages without
statistical morphologizers (replacing the v2.x
tag_map
in the language data) - map token surface form + fine-grained tags to morphological features
(replacing the v2.x
morph_rules
in the language data) - specify the tags for space tokens (replacing hard-coded behavior in the tagger)
The following example shows how the tag and POS NNP
/PROPN
can be specified
for the phrase "The Who"
, overriding the tags provided by the statistical
tagger and the POS tag map.
Editable Code
Word vectors and semantic similarity
Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:
banana.vector
Pipeline packages that come with built-in word vectors make them available as
the Token.vector
attribute.
Doc.vector
and Span.vector
will
default to an average of their token vectors. You can also check if a token has
a vector assigned, and get the L2 norm, which can be used to normalize vectors.
Editable Code
The words “dog”, “cat” and “banana” are all pretty common in English, so they’re
part of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” on
the other hand is a lot less common and out-of-vocabulary – so its vector
representation consists of 300 dimensions of 0
, which means it’s practically
nonexistent. If your application will benefit from a large vocabulary with
more vectors, you should consider using one of the larger pipeline packages or
loading in a full vector package, for example,
en_core_web_lg
, which includes 685k unique
vectors.
spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.
Each Doc
, Span
, Token
and
Lexeme
comes with a .similarity
method that lets you compare it with another object, and determine the
similarity. Of course similarity is always subjective – whether two words, spans
or documents are similar really depends on how you’re looking at it. spaCy’s
similarity implementation usually assumes a pretty general-purpose definition of
similarity.
Editable Code
What to expect from similarity results
Computing similarity scores can be helpful in many situations, but it’s also important to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single “similarity” score will always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose. Here are some important considerations to keep in mind:
- There’s no objective definition of similarity. Whether “I like burgers” and “I like pasta” is similar depends on your application. Both talk about food preferences, which makes them very similar – but if you’re analyzing mentions of food, those sentences are pretty dissimilar, because they talk about very different foods.
- The similarity of
Doc
andSpan
objects defaults to the average of the token vectors. This means that the vector for “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”. - Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.
Adding word vectors
Custom word vectors can be trained using a number of open-source libraries, such
as Gensim, FastText,
or Tomas Mikolov’s original
Word2vec implementation. Most
word vector libraries output an easy-to-read text-based format, where each line
consists of the word followed by its vector. For everyday use, we want to
convert the vectors into a binary format that loads faster and takes up less
space on disk. The easiest way to do this is the
init vectors
command-line utility. This will output a
blank spaCy pipeline in the directory /tmp/la_vectors_wiki_lg
, giving you
access to some nice Latin vectors. You can then pass the directory path to
spacy.load
or use it in the
[initialize]
of your config when you
train a model.
To help you strike a good balance between coverage and memory usage, spaCy’s
Vectors
class lets you map multiple keys to the same
row of the table. If you’re using the
spacy init vectors
command to create a vocabulary,
pruning the vectors will be taken care of automatically if you set the --prune
flag. You can also do it manually in the following steps:
- Start with a word vectors package that covers a huge vocabulary. For
instance, the
en_core_web_lg
package provides 300-dimensional GloVe vectors for 685k terms of English. - If your vocabulary has values set for the
Lexeme.prob
attribute, the lexemes will be sorted by descending probability to determine which vectors to prune. Otherwise, lexemes will be sorted by their order in theVocab
. - Call
Vocab.prune_vectors
with the number of vectors you want to keep.
Vocab.prune_vectors
reduces the current vector
table to a given number of unique entries, and returns a dictionary containing
the removed words, mapped to (string, score)
tuples, where string
is the
entry the removed word was mapped to and score
the similarity score between
the two words.
Removed words
In the example above, the vector for “Shore” was removed and remapped to the
vector of “coast”, which is deemed about 73% similar. “Leaving” was remapped to
the vector of “leaving”, which is identical. If you’re using the
init vectors
command, you can set the --prune
option to easily reduce the size of the vectors as you add them to a spaCy
pipeline:
This will create a blank spaCy pipeline with vectors for the first 10,000 words in the vectors. All other words in the vectors are mapped to the closest vector among those retained.
Adding vectors individually
The vector
attribute is a read-only numpy or cupy array (depending on
whether you’ve configured spaCy to use GPU memory), with dtype float32
. The
array is read-only so that spaCy can avoid unnecessary copy operations where
possible. You can modify the vectors via the Vocab
or
Vectors
table. Using the
Vocab.set_vector
method is often the easiest approach
if you have vectors in an arbitrary format, as you can read in the vectors with
your own logic, and just set them with a simple loop. This method is likely to
be slower than approaches that work with the whole vectors table at once, but
it’s a great approach for once-off conversions before you save out your nlp
object to disk.
Adding vectors
Language Data
Every language is different – and usually full of exceptions and special
cases, especially amongst the most common words. Some of these exceptions are
shared across languages, while others are entirely specific – usually so
specific that they need to be hard-coded. The
lang
module contains all language-specific data,
organized in simple Python files. This makes the data easy to update and extend.
The shared language data in the directory root includes rules that can be
generalized across languages – for example, rules for basic punctuation, emoji,
emoticons and single-letter abbreviations. The individual language data in a
submodule contains rules that are only relevant to a particular language. It
also takes care of putting together all components and creating the
Language
subclass – for example, English
or German
. The
values are defined in the Language.Defaults
.
Name | Description |
---|---|
Stop wordsstop_words.py | List of most common words of a language that are often useful to filter out, for example “and” or “I”. Matching tokens will return True for is_stop . |
Tokenizer exceptionstokenizer_exceptions.py | Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”. |
Punctuation rulespunctuation.py | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
Character classeschar_classes.py | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons. |
Lexical attributeslex_attrs.py | Custom functions for setting lexical attributes on tokens, e.g. like_num , which includes language-specific words like “ten” or “hundred”. |
Syntax iteratorssyntax_iterators.py | Functions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks. |
Lemmatizerlemmatizer.py spacy-lookups-data | Custom lemmatizer implementation and lemmatization tables. |
Creating a custom language subclass
If you want to customize multiple components of the language data or add support
for a custom language or domain-specific “dialect”, you can also implement your
own language subclass. The subclass should define two attributes: the lang
(unique language code) and the Defaults
defining the language data. For an
overview of the available attributes that can be overwritten, see the
Language.Defaults
documentation.
Editable Code
The @spacy.registry.languages
decorator lets you
register a custom language class and assign it a string name. This means that
you can call spacy.blank
with your custom
language name, and even train pipelines with it and refer to it in your
training config.
Registering a custom language